Line Count Benchmark

Shlomi Fish on 2008-01-01T18:20:08

OK, first of all, Happy New (Civil) Year to everybody. Then, I'd like to note that I enjoyed the Israeli 2007 Perl Workshop that I attended yesterday a lot, and would like to thank all the organisers for making it happen. I posted some notes from topics we discussed in the conference to the mailing list, so you may find it interest to read them. I may post a more thorough report later on.

Now, to the main topic of this post. I've been on Freenode's #perl the other day, when we were discussing how to count the number of lines in a file. Someone suggested opening the files, and then using <$fh> and counting the number of lines. Someone else suggested trapping the output of wc -l. Then someone argued that trapping the output of wc -l is non-portable and will cost one in a costy fork. But is it slower?

To check, I created a very large text file using the following command:

locate .xml | grep '^/home/shlomi/Backup/Backup/2007/2007-12-07/disk-fs' | \
xargs cat > mega.xml

Here, I located all the files ending with .xml in my backup and concatenated them together into a file "mega.xml". The statistics for this file are:

$ LC_ALL=C wc mega.xml
195594 1704386 17790746 mega.xml

Then I ran the following benchmark using it:

#!/usr/bin/perl

use strict;
use warnings;

use Benchmark ':hireswallclock';

sub wc_count
{
my $s = `wc -l mega.xml`;
$s =~ /^(\d+)/;
return $1;
}

sub lo_count
{
open my $in, "<", "mega.xml";
local $.;
while(<$in>)
{
}
my $ret = $.;
close($in);
return $ret;
}

if (lo_count() != wc_count())
{
die "Error";
}

timethese(100,
{
'wc' => \&wc_count,
'lo' => \&lo_count,
}
);

The results?

shlomi:~/Download$ perl ../time-various-line-counts.pl
Benchmark: timing 100 iterations of lo, wc...
lo: 18.0495 wallclock secs (16.72 usr + 1.17 sys = 17.89 CPU) @ 5.59/s (n=100)
wc: 3.70755 wallclock secs ( 0.00 usr 0.03 sys + 1.77 cusr 1.91 csys = 3.71 CPU) @ 3333.33/s (n=100)

The wc method wins and is substantially faster. It's probably because wc is written in optimised C, and so counts the lines faster, despite the fact it had forked earlier.

For small files, the pure-Perl version wins. But for large files, wc is better. But naturally, it's not portable, which may be a deal-breaker in some cases.

The lesson of this is that forking processes or calling external is sometimes a reasonable thing to do. (as MJD noted earlier in the link).


External shell tools

srezic on 2008-01-01T19:56:22

... and it is difficult to get it right. On FreeBSD there is usually some whitespace before the line count, so the regexp has to be changed to /^\s*(\d+)/.

But the results on my system (amd64-freebsd) look different: using a text file with nearly 200000 lines the wc version makes only about 22 iterations/second, much slower than on your system. The the perl version seems to be faster than on your system: 9/s.

Invalid result

srezic on 2008-01-01T20:01:05

And now I see the trap: Benchmark.pm seems to not count the CPU time from child processes! So it's not 3333/s for the wc version, but only 26.9/s.

Mmmm

Aristotle on 2008-01-02T02:44:53

sub tr_count {
    local ( $/, $_ ) = \( 2**19 );
    my $c = 0;
    open my $in, "<", $file;
    $c += y/\n// while <$in>;
    return $c;
}

Only slightly slower than wc on my machine.